Directed Networks
Conformal Bayes for Two-Sided Censored Gaussian Regression under Label Shift
Prediction under label shift becomes nonstandard when responses are censored. In a two-sided censored Gaussian model, latent values below $L$ and above $U$ are recorded at the boundary values, so the observed predictive distribution is mixed, with atoms at $L$ and $U$ and a continuous density on $(L,U)$. In this paper we develop conformal Bayes for this mixed-space setting by combining posterior predictive tilting with weighted conformal calibration. Under a two-sided Tobit Gaussian Bayesian prediction head with a Laplace posterior approximation, the tilted predictive distribution has left-atom, interior, and right-atom components, with a three-term closed-form normalizer. The resulting prediction set is a mixed highest density region that can combine boundary atoms with an interior interval and can reduce to atom-only sets under strong censoring. The main technical issue is that latent label shift does not directly give an ordinary density ratio on the observed censored scale. A latent exponential tilt induces tail-averaged atom weights at the censored boundaries, while the interior ratio remains density based. This yields a mixed observed-space calibration weight with two atom ratios and one interior density ratio. The weight corrects the calibration measure, while predictive tilting gives target-adapted mixed-HDR geometry. Synthetic experiments show that weighted tilted conformal Bayes restores marginal coverage with smaller sets than weighted source-score calibration, while revealing a trade-off between marginal coverage and component-wise behavior across atoms and interior observations.
Bayesian Best-Arm Identification with Abstention: A Polynomial-to-Exponential Phase Transition
Huang, Yuqi, Hou, Yunlong, Tan, Vincent Y. F.
We study the Bayesian fixed-budget best-arm identification problem in which a learner can abstain from making a terminal recommendation. Subject to an abstention budget $α$, we analyze the probability of undetected error--the risk of recommending a suboptimal arm without abstaining. Our central finding is that abstention induces a phase transition: without abstention, the error probability decays polynomially in the sampling budget $T$; in contrast, introducing any small positive abstention budget shifts this to an exponential decay. For Gaussian priors and rewards, in the regime $T\to\infty$ followed by $α\downarrow0$, we establish exact matching information-theoretic lower bounds and algorithmic upper bounds on the optimal error exponent, which takes the form $\exp(-\frac{α^{2}T}{8κ_ν^{2}})$. The hardness parameter $κ_ν$ represents the prior density of the top-two gap at zero, highlighting that nearly tied instances drive the fundamental error. We introduce an adaptive algorithm, PGWS, that successfully achieves this optimal exponent by expending its abstention budget on statistically ambiguous instances. We further demonstrate that this polynomial-to-exponential improvement is exclusively a Bayesian phenomenon--in the frequentist setting, abstention only affects lower-order exponent terms. We also extend our results beyond the Gaussian model.
A Mathematical Optimization Approach for Expert-Informed Bayesian Best Subset Selection
Alexander, Nolan, Mortveit, Henning
A central challenge in statistical modeling is identifying the subset of features that belong in the true regression model. The classical best subset selection problem, recently made tractable via mixed-integer optimization (MIO), finds the globally optimal sparse solution. It does not, however, make use of any information beyond the observed data. In many applied settings, domain experts can meaningfully rank or score the relevance of candidate predictors, yet no existing framework integrates such probabilistic expert assessments directly into the best-subsets objective. This paper presents Expert-Implied Bayesian Best Subsets (EBBS), a method that incorporates domain-expert probability estimates of feature relevance into the MIO best-subsets problem through a maximum a posteriori (MAP) framework. Expert views from multiple respondents are aggregated into a single prior probability per feature using the Poisson binomial distribution for marginal probability estimates, the pairwise win rate for pairwise comparisons, or the normalized mean rank for ordinal rankings. This probability enters the objective function as a log-odds penalty term that smoothly encourages or discourages the selection of each feature consistent with the expert consensus. This paper provides analytic derivations of the MAP formulation and characterizes its theoretical properties. The proposed model reduces to Best Subsets when experts all have no views. Empirical results on synthetic and real datasets are forthcoming.
Perspectives on Latent Factor Indeterminacy and its Implications for Data Representation
The common factor analytic model is related to Helmholtz and Boltzmann machines, can be conceived as a linear autoencoder, or can be thought of as a single-hidden-layer generative neural network. We thus consider it a basal generative representation learner that can be used as a minimal model for studying the foundational characteristics of (deep) generative model architectures. We focus on the fundamental problem of indeterminacy in latent factor projections. This indeterminacy implies that, even when the intrinsic dimension of the latent vector is known, regularity conditions are met, and rotational indeterminacy is resolved, an inherent indefiniteness in the retrieval of causative latent sources remains: they will be uncertain, distributionally deviant, and non-unique. This can have major implications for data representation but remains an elusive issue, even to practitioners and theorists well-versed in the factor model. Moreover, this classic psychometric problem is intricately related to the modern issue of latent variable collapse in the variational autoencoder framework for deep generative modeling. Here, we assess this indeterminacy from various perspectives and show how these are mathematically and conceptually related and we discuss subsequent implications for the Psychometrics, Statistics, and Artificial Intelligence communities. We show that one has latent factor determinacy across all its facets when the feature-dimension grows to infinity. This feeds into an essentially distribution-free estimation approach in the sample case when the number of features grows very large. We conclude, as these are emergent properties at scale, that the factor model is suited for representation learning of very-high-dimensional data.
XMSE-Aware Adaptive Empirical Bayes Estimation
Empirical Bayes (EB) estimators can match the first-order asymptotic risk of maximum likelihood (ML) while behaving very differently at second order: recent excess mean squared error (XMSE) analysis shows that kernel-based EB estimation may be worse than ML when the kernel is poorly aligned with the true parameter. This paper turns that diagnostic into a design principle. We propose an XMSE-aware mixed estimator that interpolates between ML and EB shrinkage. Its fixed-weight XMSE is a scalar quadratic, yielding a closed-form oracle mixing weight that is no worse than both ML and the base EB estimator at the XMSE scale. A plug-in implementation based on finite-sample XMSE approximations is proved consistent, with a second-order oracle regret rate for an interior oracle weight. We further establish a transfer of the regret bound to the fixed-weight risk curve evaluated at the selected weight, a thresholded boundary rule, and extensions to compact kernel families and to finite and growing kernel dictionaries with high-probability oracle bounds. Finite impulse response simulations with SURE-tuned, hard-selection, and trace-corrected baselines, together with the public Silverbox and Cascaded Tanks benchmarks, show that the proposed estimator retains most of the benefit of regularization when it is helpful and retreats toward ML under kernel misspecification, with an identified finite-de analyzed on the benchmarks.
Learning Probabilistic Filters with Strictly Proper Scoring Rules
Bach, Eviatar, Baptista, Ricardo, Bröcker, Jochen, Chen, Bohan, Stuart, Andrew
Bayesian filtering of partially and noisily observed dynamical systems seeks to infer the evolving conditional distribution of the state of a dynamical system, given observations, in an online fashion. This Bayesian filtering distribution is the natural object for uncertainty quantification, but it is rarely available as a supervised learning target. However, one can often use the forecast model to generate synthetic system trajectories, along with synthetic observations. We introduce the proper scoring ensemble filter (PSEF), an ensemble data assimilation method based on training an analysis map to approximate the filtering distribution using only synthetic state--observation trajectories. The analysis step is represented as a permutation-invariant, transformer-based map that takes as input a forecast ensemble and observations, producing an analysis ensemble. Training is based on strictly proper scoring rules -- with the energy score used in our implementation -- so that probabilistic accuracy is rewarded over the whole probability distribution. We prove that, under a realizability assumption, the population objective is minimized by the true Bayesian filtering distribution. We also derive the finite-ensemble empirical objective used in training and relate its single state--observation trajectory form to the population objective, using a mean-field consistency argument. Numerical experiments show that the learned filter accurately approximates challenging filtering distributions, including nonlinear, non-Gaussian, and multi-modal posteriors, and achieves stronger performance in data assimilation tasks than classical methods or learning-based methods with mean-squared-error objectives. For close-to-Gaussian problems, learning a correction to the EnKF is the best approach, while for highly non-Gaussian problems an end-to-end approach that discards this inductive bias is superior.
Statistical and Structural Approaches to Algorithmic Fairness
Modern machine learning systems have outgrown their origins as isolated predictive constructs, evolving into complex socio-technical architectures that actively mediate human opportunity. As algorithms increasingly determine access to economic and social opportunities, it has become widely recognized that these systems are deeply embedded with the structural inequalities and prejudices of their environments. The field of algorithmic fairness emerged in response to the growing recognition that models optimized for predictive accuracy can systematically disadvantage marginalized groups. Early mitigation strategies, however, rested on fragile simplifications that limited their effectiveness in complex sociotechnical environments. This thesis identifies and addresses two fundamental limitations of contemporary fairness paradigms: the reliance on deterministic point estimates for auditing and the treatment of individuals as isolated entities devoid of structural context. First, the diagnosis of algorithmic unfairness has traditionally depended on scalar metrics that fail to capture the nuances of real-world deployment. This deterministic approach ignores the high statistical variance inherent in small, intersectional groups, often leading to false alarms or missed detections of bias. Furthermore, standard auditing struggles with the opacity of black-box models, frequently conflating unjustifiable bias with the influence of legitimate features.
A probabilistic framework for online test-time adaptation
Corrales, Daniel, Insua, David Ríos
This paper presents a probabilistic framework for online test-time adaptation problems. In them, a model is trained on labeled data but must adapt to unlabeled data at test time under the assumption that training and test distributions potentially differ, that is, there might have been a distributional shift. The framework is based on a state-space modelling architecture from which parameter learning, parameter time evolution, prior tuning, and prediction can be characterized.
Information from coincidences
We prove a single algebraic mixed coincidence identity that unifies a broad swath of information-theoretic variational results. For any family of priors $\{π_i\}$ and real exponents $\{ α_i \}$, the log of the mixed count $E_{x\simν}\!\left[\prod_{i=1}^W π_i^{α_i}(x)\right]$ is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. The identity yields a unified derivation of classical cornerstones of information theory: concentration of empirical distributions (Sanov-type decompositions and Gibbs conditioning), hypothesis-testing error exponents (Chernoff information and its multi-way analogue), change-of-measure inequalities (Donsker-Varadhan and PAC-Bayes), and laws governing rare-pattern coincidences (Erdos-Renyi run-length, iterative guesswork, rate-distortion, and birthday thresholds). Each is recovered as a specialization of the same algebraic equality. It strictly generalizes the classical Renyi entropy and divergence variational formulas (one and two priors respectively) to a $W$-prior simplex, and holds for unnormalized and continuum-indexed priors. Among its consequences are an exact multi-prior PAC-Bayes penalty that subtracts an explicit "coincidence bonus" from the usual single-prior posterior penalty, and the asymptotic MAP error exponent for $W$-ary hypothesis testing as an edge-restricted simplex optimum. We demonstrate the calculus at scale on two large alphabets encoding richly modeled sequential languages: on language-model next-token predictives where we recover contrastive decoding, and on human genomic regulatory sequence where it separates correlated from diverse prior families along a sliding-window trace.
Hierarchical Partial-Order Models for Ranking
Li, Dongqing, Nicholls, Geoff K., Lee, Jeong Eun, Chuxuan, null, Jiang, null
Rank aggregation combines information from ordered lists ranking items by preference. Classical parametric models for such data, including the Mallows and Plackett-Luce models, assume the orders concentrate around one or more complete consensus rankings. Recent work relaxes the total-order assumption by allowing the consensus structure to be a partial order (poset), allowing for incomparabilities in preferences. However, in many applications preference data exhibit group structure. We introduce hierarchical partial order (HPO) models, which extend poset-based models to accommodate grouped data through a hierarchy of latent posets. This framework, which parallels mixture model extensions of the Mallows and Plackett-Luce models, enables principled sharing of information across groups while preserving partial-order structure. We show that the Plackett-Luce model and its hierarchical variants are special cases of HPO-models. We develop a hierarchical clustering extension (HCPO) for unsupervised clustering in settings where group labels are unknown. Bayesian inference for the latent poset hierarchy is performed using Markov chain Monte Carlo methods. Experiments on synthetic and real-world datasets, including pairwise acoustic preference data and LLM agent traces, demonstrate that the proposed HPO and HCPO models outperform existing approaches in both predictive performance and structural interpretability.